Incremental ETL Pipeline Scheduling for Near Real-Time Data Warehouses

نویسندگان

Weiping Qu

Stefan Deßloch

چکیده

We present our work based on an incremental ETL pipeline for on-demand data warehouse maintenance. Pipeline parallelism is exploited to concurrently execute a chain of maintenance jobs, each of which takes a batch of delta tuples extracted from source-local transactions with commit timestamps preceding the arrival time of an incoming warehouse query and calculates Ąnal deltas to bring relevant warehouse tables up-to-date. Each pipeline operator runs in a single, non-terminating thread to process one job at a time and re-initializes itself for a new one. However, to continuously perform incremental joins or maintain slowly changing dimension tables (SCD), the same staging tables or dimension tables can be concurrently accessed and updated by distinct pipeline operators which work on diferent jobs. Inconsistencies can arise without proper thread coordinations. In this paper, we proposed two types of consistency zones for SCD and incremental join to address this problem. Besides, we reviewed existing pipeline scheduling algorithms in our incremental ETL pipeline with consistency zones.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Real-Time Snapshot Maintenance with Incremental ETL Pipelines in Data Warehouses

Multi-version concurrency control method has nowadays been widely used in data warehouses to provide OLAP queries and ETL maintenance flows with concurrent access. A snapshot is taken on existing warehouse tables to answer a certain query independently of concurrent updates. In this work, we extend this snapshot with the deltas which reside at the source side of ETL flows. Before answering a qu...

متن کامل

Near-real-time Parallel Etl+q for Automatic Scalability in Bigdata

In this paper we investigate the problem of providing scalability to near-real-time ETL+Q (Extract, transform, load and querying) process of data warehouses. In general, data loading, transformation and integration are heavy tasks that are performed only periodically during small fixed time windows. We propose an approach to enable the automatic scalability and freshness of any data warehouse a...

متن کامل

Next-generation ETL Framework to Address the Challenges Posed by Big Data

The specific features of Big Data i.e., variety, volume, and velocity call for special measures to create ETL data pipelines and data warehouses. A rapidly growing need for analyzing Big Data calls for novel architectures for warehousing the data, such as data lakes or polystores. In both of the architectures, ETL processes serve similar purposes as in traditional data warehouse architectures. ...

متن کامل

Efficient ETL+Q for Automatic Scalability in Big or Small Data Scenarios

In this paper, we investigate the problem of providing scalability to data Extraction, Transformation, Load and Querying (ETL+Q) process of data warehouses. In general, data loading, transformation and integration are heavy tasks that are performed only periodically. Parallel architectures and mechanisms are able to optimize the ETL process by speedingup each part of the pipeline process as mor...

متن کامل

Striving towards Near Real-Time Data Integration for Data Warehouses

The amount of information available to large-scale enterprises is growing rapidly. While operational systems are designed to meet well-specified (short) response time requirements, the focus of data warehouses is generally the strategic analysis of business data integrated from heterogeneous source systems. The decision making process in traditional data warehouse environments is often delayed ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Incremental ETL Pipeline Scheduling for Near Real-Time Data Warehouses

نویسندگان

چکیده

منابع مشابه

Real-Time Snapshot Maintenance with Incremental ETL Pipelines in Data Warehouses

Near-real-time Parallel Etl+q for Automatic Scalability in Bigdata

Next-generation ETL Framework to Address the Challenges Posed by Big Data

Efficient ETL+Q for Automatic Scalability in Big or Small Data Scenarios

Striving towards Near Real-Time Data Integration for Data Warehouses

عنوان ژورنال:

اشتراک گذاری